Distributed Computing
#distributedenvironments #distributedsystems #distributed #clocks #network #failure
What happens when everything goes wrong? what if only #partialfailures happens?
The real problems roses only, when partial failures happens, because they introduce nondeterministic operations to the environment which might cause inconsistencies and unpredictable results.
There are two main types of computing environment.
- #supercomputers
- #cloudscomputing
Super computers always prefer to escalate partial failure to total failure, thus keeping consistency. Partial failure its a real deal on cloud computing by which we depend upon side effects on network and real life.
The usual way to handle the possibility of an receiver not receiving or responding from a request is a timeout. But for how long?
Premature timeouts: if you declare a node dead prematurely, you can overload the entire cluster. If there only too much overload causing a delay in the response time, killing nodes may even increase it further.
Late timeouts: can cause an horrifying experience for the end-user and a performance hit.
For enterprise application on cloud enviroments, you need to decide the timeout experimentally. Monitoring it and taking into account the context of the application logic. Even so, dynamically changing it with metrics. Suggestions : Phi Accrucal failure detector.
#clocks
we can separate the usage of clocks and time on software on two categories:
- points in time
- duration/measure
The time comes always from the hardware, which may not be synchronized perfectly. #time-of-day clocks are not usefol for measuring elapsed time, since there can be differenteces between hardwares, even thou they are on NTP. This is a job for #monoticclocks
if your software requires high precision for time of day and synchronized clocks, you need to ensure the offset of all nodes, if one drifts way too much, declare it dead.
One clear example is syncrhonization of data on replicas, with #lastwritewins #lww. which depends on the timestamp from the change..
Databases such as Spanner, utilize the timestamp to generate a transaction id for Snapshop isolation. Since generating transaciton id on distributed databases can be conflicting.
#clocks #GC #RTOS
#processpauses
Another critical and dangerous usage of clocks, is to lock processes depending on timestamps(ex: timeouts)
Partitions with single leaders often use lease . Each time an leader is elected, it gets a lease which has an expiration time. Before expiring, the leader has to renew the #lease.
On example, using single leader, is that the time to renew can be overlapped by a GC-looop, container pausing/resuming, heavy IO, disk swap, physical closing of laptop lid or even a UNIX process SIGSTOP.
Preventing GC : the runtime can actually notify that the node will go through GC and routing any traffic to other instance. Or simply kill the node and restart before long-lived object GC.
One way to prevent leases and corruption of date, when writing with single leader. is to add fencing tokens to lease.
Whenever we claim a lease, we also add an token(that could be incremented or whatever) so we check if the current writing node has the valid token
Tip: if you use zookeper for locking service, either the zxid(trnsaction id) or the node version (cversion) can be used as fencing token.
Whenever we are modeling a system, we should categorize them an assume what can and cannot happen so we can write our algorithms.
We can category the models as(timing):
- synchronous, were everything is as expected
- partially synchronous, were most of the time it is.
- asynchronous, were the algo cannot make any assumption about timing
And regarding fault-tolerance, we can categorize it as:
- crash-stop faults, where the system stops 100% and restarts.
- crash-recovery faults, where we assume the data lost can be retried. (it is persisted somwhow)
- nodes can even deceive our internal systems.
Algorithms must have an correctness based on some properties, that can be #liveness properties or #safety. such as uniqueness, #monotonicsequence, #availability
To map real world to system model we need to make assumptions, and even handle edge cases that were never meant to happen, requiring human work.